Pathogen detection in RNA-seq data with Pathonoia

Background Bacterial and viral infections may cause or exacerbate various human diseases and to detect microbes in tissue, one method of choice is RNA sequencing. The detection of specific microbes using RNA sequencing offers good sensitivity and specificity, but untargeted approaches suffer from high false positive rates and a lack of sensitivity for lowly abundant organisms. Results We introduce Pathonoia, an algorithm that detects viruses and bacteria in RNA sequencing data with high precision and recall. Pathonoia first applies an established k-mer based method for species identification and then aggregates this evidence over all reads in a sample. In addition, we provide an easy-to-use analysis framework that highlights potential microbe-host interactions by correlating the microbial to the host gene expression. Pathonoia outperforms state-of-the-art methods in microbial detection specificity, both on in silico and real datasets. Conclusion Two case studies in human liver and brain show how Pathonoia can support novel hypotheses on microbial infection exacerbating disease. The Python package for Pathonoia sample analysis and a guided analysis Jupyter notebook for bulk RNAseq datasets are available on GitHub. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-023-05144-z.


SUPPLEMENTARY MATERIAL
Supporting Information -Pathonoia S1 Pathonoia Algorithm Details Pathonoia was developed for analyzing the adhering metagenome of an RNA-seq sample or dataset. The algorithm takes FASTA files as input, which are, in the first step, evaluated by the Kraken 2 [1] algorithm. Kraken 2 accepts also fastQ files and single-end, as well as paired-end read files. Pathonoia's code can be easily adjusted to accept those file types as well.
The Kraken 2 index used for Pathonoia and the results in the main manuscript was built in March 2019 using the following commands: ./kraken2-build --download-taxonomy --db db/bacvir k31 ./kraken2-build --download-library bacteria --db db/bacvir k31 --use-ftp ./kraken2-build --download-library viral --db db/bacvir k31 --use-ftp ./kraken2-build --build --db db/bacvir k31 --threads 8 --kmer-len 31 --minimizer-len 31 ./kraken2-build --clean --db db/bacvir k31 An updated index from May 2022, created with the same commands, is available for download, on https://oasis.ims.bio/kraken2 k31 2205.zip. We set the k-mer length equal to the maximal minimizer length in Kraken 2 k = l = 31. This forces Kraken 2 to use no optimization (in terms of memory usage) at the cost of precision. You may refer to the original Kraken 2 publication [1] for details. Since the relatively short k = 31 (default is k = 35) allows for too many random (incorrect) matches to the index, Pathonoia only considers sequences, where at least z > 4 k-mers in a row were classified identically (not mentioned in figure). We use z > 4 for matching the original k-mer default length of k = 35, where our index was built with k = 31.
The runtime of Pathonoia is majorly based on Kraken 2, as it takes the majority of runtime and memory of the algorithm. Pathonoia can be considered an "add-on" to Kraken 2 and scans the results in one pass. For example, a dataset of 125 samples (∼ 50GB of fastq.gz files) took 50 min processing time (for the add-on part alone, parallel processing turned off), including the merge of the results, after having run Kraken 2 for all samples within several hours on the same server. Kraken 2's runtime has been benchmarked previously by Ye et al., [5] and is in our experience greatly dependent on the available RAM to load the index, while the actual runtime with 32 threads is around 10 seconds for each sample. S1.1 Exceptions of discovered species, not discovered by Kraken Kraken 2 classifies each read with the Lowest Common Ancestor (LCA) of the species corresponding to all k-mers of this read. Pathonoia is based on the same species -k-mer alignments but abstracts from their read  "environment". It counts the k-mers of the same species for all k-mers (in all reads) of the same sample.
Therefore, it can theoretically be, that some species will not be reported by Kraken 2 that is discovered by Pathonoia. In that case the LCA classification algorithm of Kraken 2 is masking it.

S1.2 Recall, Precision and F1 Score for Benchmarked Algorithms
The following tables contain the underlying data for our benchmark and Figures 2D-E. They are recall, precision and F1 score for the seven simulated samples and their average per evaluated algorithm.

S2.1 Settings for associated tools
For the analysis of our example datasets, we used the STAR aligner [3] for retrieving aligned (and unaligned) reads using the following settings: . In the final step of the downstream analysis, we use WebGestalt for the functional enrichment analysis of the (human) gene sets. We use the R version of the tool for the analysis of biological processes and molecular functions, with the following settings: WebGes BioProc <-function(genes, projectname) WebGestaltR(enrichMethod="ORA", organism="hsapiens", enrichDatabase="geneontology Biological Process", interestGene=genes,interestGeneType="ensembl gene id", referenceGeneType="genesymbol", referenceSet="genome protein-coding", minNum=5, maxNum=2000, fdrMethod="BH", sigMethod="fdr", fdrThr=0.05, topThr=10, reportNum=20, perNum=1000, nThreads=64, isOutput=TRUE, outputDirectory="GO results/GO Analysis biolProcess FDR05", projectName=projectname, dagColor="continuous", hostName="http://www.webgestalt.org/") WebGes MolFun <-function(genes, projectname) WebGestaltR(enrichMethod="ORA", organism="hsapiens", enrichDatabase="geneontology Molecular Function", interestGene=genes,interestGeneType="ensembl gene id",referenceGeneType="genesymbol", referenceSet="genome protein-coding", minNum=5, maxNum=2000, fdrMethod="BH",sigMethod="fdr", fdrThr=0.05,topThr=10, reportNum=20, perNum=1000, nThreads=64, isOutput=TRUE,    Table ST4 Using SEAweb with the search terms "brain" and "Burkholderia Pseudomallei" (closest to B.Stabilis as the latter was not found in SEA datasets), we find significant differential abundance of B. Pseudomallei in two independent datasets concerning neurological diseases comparing patient samples and control.  TMEM39B SLC9A1  UBE2E2  PHACTR1  OPTN  MIR548H3  IL1RL1  SSU72  LRFN3  CORO2B  BAAT  AKNAD1  SIRPA  RN7SL32P  IL18R1  GSK3A  NEFM  SKIL  LARP1  PRRC2B  CD24  DOCK7  ATP1A1  SCAMP5  SALRNA1  KCNN1  NOMO1  PRDM2 HNRNPUL2 KCNC3 DYNLL2 LZTS3 GNG13 HAS1 Figure SF8 Top 9 differentially expressed organisms. The Pathonoia abundance for the nine most significant differentially expressed organisms between control and diseased samples is shown. Values are given for all disease phenotypes separately. Mostly the mean (blue bar) and median (red bar) abundance per sample is zero since presence of specific organisms is rare. Burkholderia stabilis is the most present organism in seven diseased samples. Even though the analysis was done comparing diseased and control samples, this plot shows the four different phenotypes of FTD. Here it can be seen how big values in single samples can lead to significant group differences. Nevertheless, some of the organisms show consistent increase in the diseased samples but not in the controls.  Figure SF9 Diseased samples with and without B.stabilis were analyzed for differential gene expression, for understanding how it might be involved in the disease. A 143 genes were discovered as significantly differentially expressed with an adjusted p-value < 0.05 (109 down regulated, 34 up regulated). Two down regulated genes show very high log2 fold change combined with an extremely low p-value. These genes are AC012184.2 and FP236383.9. B Their expression values (TPM) are given in the individual samples. AC012184.2 is expressed in choroid plexus in the brain according to the Ensemble expression atlas [6]. According to its GeneCard [7] it produces a novel, uncharacterized protein and is involved in ATP binding. For FP236383.9 much less information can be found. It was manually annotated by the Sanger Institute havana project [8], is uncharacterized and to be experimentally confirmed (TEC). According to MirTarBase [9] though, the miRNA hsa-mir-204-5p is targeting this gene. Using SEA [10], it can be found that this miRNA is associated with retinal degeneration [11] and that the highest expression of this miRNA is found in eye and brain areas. Furthermore, various brain regions show up in the differential expression analysis results with high significance for this miRNA. 28.86658 1.31E-10 2 Figure SF10 Top 12 differentially expressed organisms. The Pathonoia abundance for the nine most significant differentially expressed organisms between fibrosis and no fibrosis samples is shown. Values are given for all fibrosis levels separately. Mostly the median (red bar) abundance per sample is zero since presence of specific organisms is rare. Even though the analysis was done comparing fibrosis vs no-fibrosis samples, this plot shows all fibrosis levels. Only Burkholderia sp. OLGA172 shows abundance in non-fibrotic samples, but only two samples contain this pathogen.